The purpose of this project is to predict the residential house price in Ames, Iowa. We basically intent to create linear model which can predict house prices in Iowa based on different predictor variables. This is basically how companies like Redfin/Zillow gives their zestimate in real life.
Our final project will try to touch some of the concepts mentioned below:
The dataset is provided by Dean De Cock from Truman State University and our initiate is Inspired by Kaggle problem https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview.
The data consists of 2930 observations and 82 variables. This are mix of nominal , discrete , ordinal and conitnous variables.
This is defintely a huge dataset and we will try to subset the data and work with a set of 25 predictors which we feel are relevant as a part of our study. Also we will only use residential properties in our analysis.
We loaded housing_data into R.
Some of the important variable looking at the dataset are.
MS Zoning: Identifies the general zoning classification of the sale.Lot Area: Lot size in square feet.Overall Qual: Overall material and finish quality.Total Bsmt SF: Total square feet of basement area.Year Built: Original construction date.Bedroom AbvGr: Bedrooms above grade (does NOT include basement bedrooms).Bsmt Full Bath: Full bathrooms above grade.GarageCars: Size of garage in car capacity.Street: Type of road access to property#Install all packages
library(readr)
library(stringr)
library(ggplot2)
library(dplyr)
library(corrplot)
library(tidyverse)
library("VIM")
library(car)
library(lmtest)
library(caret)
library(rpart)
library(rpart.plot)
library(PerformanceAnalytics)
library(trafo)
#install.packages("stringr")
#install.packages("corrplot")
#install.packages("tidyverse")
#install.packages("car")
housing_data = read_csv("housing_data.csv")
names(housing_data) = str_replace_all(names(housing_data), c(" " = "." , "," = "" ))
housing_data[c(1:10),c(1:5)]
## # A tibble: 10 x 5
## Order PID MS.SubClass MS.Zoning Lot.Frontage
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 526301100 20 RL 141
## 2 2 526350040 20 RH 80
## 3 3 526351010 20 RL 81
## 4 4 526353030 20 RL 93
## 5 5 527105010 60 RL 74
## 6 6 527105030 60 RL 78
## 7 7 527127150 120 RL 41
## 8 8 527145080 120 RL 43
## 9 9 527146030 120 RL 39
## 10 10 527162130 60 RL 60
To start off our analysis, we will do some Data cleaning. We will check and select some predictors and create a subset rather than all variable. Also we will check for variable which have way too many ‘NA’ as it can impact our result.
sapply(housing_data, function(x) sum(is.na(x)))
Below is the graphical representation of NA values.
missing_data = housing_data[sapply(housing_data, function(x) sum(is.na(x))) > 100]
aggr(missing_data)
We see the 2 columns which have a lot of NA are Alley and Lot Frontage.
Out of this we are taking out Alley as out of 2930 records only , it has 2732 as ‘NA’ so it is not really useful. Also columns like Fireplace.Qu,Pool.QC,Fence,Misc.Feature has lot of NA values.
We can use Lot Frontage and will replace the missing values with the mean of the data.Same Applicable too Garage.Cars and Total.Bsmt.SF
NA_values=data.frame(no_of_na_values=colSums(is.na(housing_data)))
#Replace NA values with mean
housing_data$Lot.Frontage[which(is.na(housing_data$Lot.Frontage))] = mean(housing_data$Lot.Frontage,na.rm = TRUE)
housing_data$Total.Bsmt.SF[which(is.na(housing_data$Total.Bsmt.SF))] = mean(housing_data$Total.Bsmt.SF,na.rm = TRUE)
housing_data$Garage.Area[which(is.na(housing_data$Garage.Area))] = mean(housing_data$Garage.Area,na.rm = TRUE)
names(housing_data)[names(housing_data) == "Year.Remod/Add"] = "Yr.Renovated"
house_data_subset = subset(housing_data,
select = c (SalePrice,
Lot.Area,
Gr.Liv.Area,
Garage.Cars,
Total.Bsmt.SF,
Bedroom.AbvGr,
TotRms.AbvGrd,
Full.Bath,
Half.Bath,
Overall.Qual,
Overall.Cond ,
Year.Built ,
Yr.Renovated,
Fireplaces,
Bsmt.Full.Bath,
Bsmt.Half.Bath,
Lot.Frontage,
Garage.Area,
MS.Zoning,
Bsmt.Qual,
Kitchen.Qual,
Paved.Drive,
Foundation,
Central.Air,
Garage.Type
))
We have identified below predictors as factor predictors.
#Filter out and keep only Residential properties
target = c("A (agr)", "C (all)", "I (all)")
house_data_subset = filter(house_data_subset, !MS.Zoning %in% target)
#Coerce as.factor for categorical variables
house_data_subset$Paved.Drive = as.factor(house_data_subset$Paved.Drive)
house_data_subset$MS.Zoning = as.factor(house_data_subset$MS.Zoning)
house_data_subset$Foundation = as.factor(house_data_subset$Foundation)
house_data_subset$Kitchen.Qual = as.factor(house_data_subset$Kitchen.Qual)
# add NA as factor
house_data_subset$Bsmt.Qual = factor(house_data_subset$Bsmt.Qual)
house_data_subset$Bsmt.Qual = addNA(house_data_subset$Bsmt.Qual, ifany = TRUE)
house_data_subset$Garage.Type = factor(house_data_subset$Garage.Type)
house_data_subset$Garage.Type = addNA(house_data_subset$Garage.Type, ifany = TRUE)
Here as a part of our exploratory Data Analysis, we are just plotting the data to see how it looks for certain predictors.
Test_data = house_data_subset
form = as.formula(SalePrice ~ Fireplaces + Overall.Cond + Overall.Qual +
Lot.Area + Bedroom.AbvGr + Year.Built + Gr.Liv.Area +
Garage.Area )
mdl = rpart(form, data = Test_data)
rpart.plot(mdl, type = 5, clip.right.labs = FALSE, branch = .3, under = TRUE)
Let’s use pair plot to see relationship between certain variables, we will subselect only numerical variables
chart.Correlation(house_data_subset[,1:10])
Also we will check correlation between numerical variable just to get started with which models we should be relevant to us
data = na.omit(house_data_subset)
corrplot(cor(data[sapply(data, is.numeric)]), method ="ellipse",
title = "Correlation Matrix Graph",tl.cex = .5, tl.pos ="lt", tl.col ="dodgerblue" )
house_correlation_var=cor(data.frame(house_data_subset[,1:18]), use = "complete.obs")
sort(abs(house_correlation_var["SalePrice", ]), decreasing = TRUE)
## SalePrice Overall.Qual Gr.Liv.Area Garage.Cars Garage.Area
## 1.00000000 0.79741564 0.70712246 0.64634514 0.64109841
## Total.Bsmt.SF Year.Built Full.Bath Yr.Renovated TotRms.AbvGrd
## 0.63003325 0.55131301 0.54105888 0.52605103 0.49315068
## Fireplaces Lot.Frontage Half.Bath Lot.Area Bsmt.Full.Bath
## 0.46925240 0.34609586 0.28139553 0.27211261 0.27031218
## Bedroom.AbvGr Overall.Cond Bsmt.Half.Bath
## 0.13666648 0.11567870 0.03871955
High levels of correlation amongst multiple predictors is confirmed by the above corrplot.
Let’s create few models from the model selection process and perform Model diagnostics. We will check Adjusted R2 for the models and also if any of the items has collinearity issue and high VIF.
full_model = lm(SalePrice ~ . , data = house_data_subset)
selected_model_aic = step(full_model, data = house_data_subset, trace = 0)
summary(selected_model_aic)$adj.r
## [1] 0.8494521
This Model has an adjusted R2 of 0.8494521
FirstModel = lm(
SalePrice ~ Fireplaces + TotRms.AbvGrd + Total.Bsmt.SF + Overall.Cond + Overall.Qual + Lot.Area + Kitchen.Qual + Bedroom.AbvGr + Year.Built + Full.Bath + Half.Bath + Gr.Liv.Area + Central.Air + Foundation + Garage.Area + Garage.Cars+ Garage.Type + Bsmt.Full.Bath + Bsmt.Half.Bath,
data = house_data_subset
)
This Model has adjusted R2 of 0.8366836
SecondModel_Quadratic = lm(SalePrice ~ Fireplaces + TotRms.AbvGrd + Total.Bsmt.SF + Overall.Cond + Overall.Qual + Lot.Area + Kitchen.Qual + Bedroom.AbvGr + Year.Built + Full.Bath + Half.Bath + Gr.Liv.Area + Central.Air + Foundation + Garage.Area + Garage.Cars + Garage.Type + Bsmt.Full.Bath + Bsmt.Half.Bath + I(Fireplaces^2) + I(TotRms.AbvGrd^2) + I(Total.Bsmt.SF^ 2) + I(Overall.Cond^2) + I(Overall.Qual^2) +(Lot.Area^2) + I(Full.Bath^2) + I(Half.Bath^2) + I(Gr.Liv.Area^2) + I(Garage.Area^2)+ I(Garage.Cars^2) +I(Bsmt.Full.Bath^2) + I(Bsmt.Half.Bath^2) , data = house_data_subset)
SecondModel = step(SecondModel_Quadratic,trace = 0)
This Model has adjusted R2 of 0.8790733
ThirdModel = lm(SalePrice ~ (Fireplaces + TotRms.AbvGrd + Total.Bsmt.SF + Overall.Cond + Overall.Qual + Lot.Area + Kitchen.Qual + Bedroom.AbvGr + Year.Built + Full.Bath + Half.Bath + Gr.Liv.Area + Central.Air + Foundation + Garage.Area + Garage.Cars + Garage.Type + Bsmt.Full.Bath + Bsmt.Half.Bath) ^ 2 , data = house_data_subset)
This Model has adjusted R2 of 0.924139
ForthModel = lm(
log(SalePrice) ~ Fireplaces * Total.Bsmt.SF + Overall.Cond + Overall.Qual +
log(Lot.Area) + Kitchen.Qual + Bedroom.AbvGr + Year.Built + log(Gr.Liv.Area) +
Garage.Area + Garage.Cars, data = house_data_subset
)
summary(ForthModel)$adj.r
## [1] 0.8849385
This Model has adjusted R2 of 0.8849385
We are going to analyze the FirstModel, SecondModel , ThirdModel and ForthModel selection process.
par(mfrow=c(4,2))
diagnostics(FirstModel, plotit = TRUE, testit = FALSE)
diagnostics(SecondModel, plotit = TRUE, testit = FALSE)
diagnostics(ThirdModel, plotit = TRUE, testit = FALSE)
diagnostics(ForthModel, plotit = TRUE, testit = FALSE)
M1 = as.numeric(bptest(FirstModel)$p.value)
M2 = bptest(SecondModel)$p.value
M3 = bptest(ThirdModel)$p.value
M4 = bptest(ForthModel)$p.value
val= c(M1,M2,M3,M4)
output = data.frame(
"ModelName" = c("`Model 1`", "`Model 2`", "`Model 3`", "`Model 4`"),"`BPTestPval`" = val
)
output
## ModelName X.BPTestPval.
## 1 `Model 1` 1.048867e-153
## 2 `Model 2` 1.683046e-125
## 3 `Model 3` 6.070092e-65
## 4 `Model 4` 1.615120e-78
The above results shows that the P value is very low, hence we reject homoscedasticity (Constant variance assumption not violated). Also we can see the BP value for the fourth model is better as compared to rest of the models.
diagnostics(FirstModel, plotit = FALSE, testit = TRUE)
## $p_val
## [1] 2.535269e-53
##
## $decision
## [1] "Reject"
diagnostics(SecondModel, plotit = FALSE, testit = TRUE)
## $p_val
## [1] 5.921164e-45
##
## $decision
## [1] "Reject"
diagnostics(ThirdModel, plotit = FALSE, testit = TRUE)
## $p_val
## [1] 2.221852e-34
##
## $decision
## [1] "Reject"
diagnostics(ForthModel, plotit = FALSE, testit = TRUE)
## $p_val
## [1] 8.738884e-41
##
## $decision
## [1] "Reject"
The above Shapiro Test results shows that the P value is very low, hence we reject homoscedasticity (linearity is not a suspect).
#Leverage
lv_m1 = length(hatvalues(FirstModel)[hatvalues(FirstModel) > 2 * mean(hatvalues(FirstModel))])
lv_m2 = length(hatvalues(SecondModel)[hatvalues(SecondModel) > 2 * mean(hatvalues(SecondModel))])
lv_m3 = length(hatvalues(ThirdModel)[hatvalues(ThirdModel) > 2 * mean(hatvalues(ThirdModel))])
lv_m4 = length(hatvalues(ForthModel)[hatvalues(ForthModel) > 2 * mean(hatvalues(ForthModel))])
# Outliers
ol_m1 = length(rstandard(FirstModel)[abs(rstandard(FirstModel)) > 2])
ol_m2 = length(rstandard(SecondModel)[abs(rstandard(SecondModel)) > 2])
ol_m3 = length(rstandard(ThirdModel)[abs(rstandard(ThirdModel)) > 2])
ol_m4 = length(rstandard(ForthModel)[abs(rstandard(ForthModel)) > 2])
# Influential
inf_obs_m1 = length(cooks.distance(FirstModel)[cooks.distance(FirstModel) > 4 / length(cooks.distance(FirstModel))])
inf_obs_m2 = length(cooks.distance(SecondModel)[cooks.distance(SecondModel) > 4 / length(cooks.distance(SecondModel))])
inf_obs_m3 = length(cooks.distance(ThirdModel)[cooks.distance(ThirdModel) > 4 / length(cooks.distance(ThirdModel))])
inf_obs_m4 = length(cooks.distance(ForthModel)[cooks.distance(ForthModel) > 4 / length(cooks.distance(ForthModel))])
#Summarizing The Result
output = data.frame(
"ModelName" = c("`Model 1`", "`Model 2`", "`Model 3`", "`Model 4`"),
"TotalUnfluentialObs" = c(
lv_m1 ,
lv_m2,
lv_m3,
lv_m4
),
"TotalOutliers" = c(
ol_m1 ,
ol_m2,
ol_m3,
ol_m3
),
"InfluentialObs" = c(
inf_obs_m1 ,
inf_obs_m2,
inf_obs_m3,
inf_obs_m4
)
)
knitr::kable(output)
| ModelName | TotalUnfluentialObs | TotalOutliers | InfluentialObs |
|---|---|---|---|
Model 1 |
230 | 88 | 132 |
Model 2 |
258 | 125 | 165 |
Model 3 |
365 | 191 | 272 |
Model 4 |
195 | 191 | 153 |
What we see in all the model Residual vs Residual Plot and Normal QQ plot is that they are not perfect and Test performed above also proved that there is something wrong with our models
salepriceBCMod = caret::BoxCoxTrans(house_data_subset$SalePrice)
house_data_subset = cbind(house_data_subset, salePrice_new=predict(salepriceBCMod, house_data_subset$SalePrice))
LotAreaBCMod = caret::BoxCoxTrans(house_data_subset$Lot.Area)
house_data_subset = cbind(house_data_subset, LotArea_new=predict(LotAreaBCMod, house_data_subset$Lot.Area))
GrLivAreaBCMod = caret::BoxCoxTrans(house_data_subset$Gr.Liv.Area)
house_data_subset = cbind(house_data_subset, GrLivArea_new=predict(GrLivAreaBCMod, house_data_subset$Gr.Liv.Area))
cd_model4 = cooks.distance(ForthModel)
# Remove influential observation and use variable above created from Box Cox Plot
ForthModel_fixed = lm(
salePrice_new ~ Fireplaces + Overall.Cond + Overall.Qual +
LotArea_new + Bedroom.AbvGr + Year.Built + GrLivArea_new +
Garage.Area ,
data = house_data_subset,
subset = cd_model4 <= 4 / length(cd_model4)
)
summary(ForthModel_fixed)$adj.r
## [1] 0.9063674
bptest(ForthModel_fixed)
##
## studentized Breusch-Pagan test
##
## data: ForthModel_fixed
## BP = 66.836, df = 8, p-value = 2.089e-11
diagnostics(ForthModel_fixed, pcol = "grey", lcol = "green")
## $p_val
## [1] 2.885935e-05
##
## $decision
## [1] "Reject"
The ‘FixedModel’ looks very promising in terms of constant variance and linear relationship as compared to previous models.
ncvTest(ForthModel_fixed)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 4.234469, Df = 1, p = 0.039611
spreadLevelPlot(ForthModel_fixed)
##
## Suggested power transformation: -0.04237718
boxcox(ForthModel_fixed, plotit = TRUE)
## Box-Cox Transformation
##
## Estimation method: ml
## Optimal parameter: -0.5414022
## Loglike: -5961.935
##
## Summary of transformed variables
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.329 1.361 1.366 1.367 1.372 1.393
Based on the Results from box cox plot, it suggested us to do a power transformation of the Reponse Variable
cd_model4 = cooks.distance(ForthModel)
# Remove influential observation and use variable above created from Box Cox Plot
ForthModel_fixed = lm(
(salePrice_new ^ -0.04237718) ~ Fireplaces + Overall.Cond + Overall.Qual +
LotArea_new + Bedroom.AbvGr + Year.Built + GrLivArea_new +
Garage.Area ,
data = house_data_subset,
subset = cd_model4 <= 4 / length(cd_model4)
)
summary(ForthModel_fixed)$adj.r
## [1] 0.9071952
bptest(ForthModel_fixed)
##
## studentized Breusch-Pagan test
##
## data: ForthModel_fixed
## BP = 64.379, df = 8, p-value = 6.401e-11
diagnostics(ForthModel_fixed, pcol = "grey", lcol = "green")
## $p_val
## [1] 2.299218e-07
##
## $decision
## [1] "Reject"
Utility Function
rmse = function(actual, predicted) {
sqrt(mean((actual - predicted) ^ 2))
}
get_rmse = function(model, data, response) {
rmse(actual = data[, response],
predicted = predict(model, data))
}
get_complexity = function(model) {
length(coef(model)) - 1
}
house_data_subset_mod = subset(house_data_subset, cd_model4 <=
4/length(cd_model4))
set.seed(108)
n = nrow(house_data_subset_mod)
house_data_subset_idx=sample(nrow(house_data_subset_mod),round(nrow(house_data_subset) / 2))
house_data_subset_trn = house_data_subset_mod[house_data_subset_idx, ]
house_data_subset_tst = house_data_subset_mod[-house_data_subset_idx, ]
ForthModel_fixed_train = lm(
salePrice_new ~ Fireplaces + Overall.Cond + Overall.Qual +
LotArea_new + Bedroom.AbvGr + Year.Built + GrLivArea_new +
Garage.Area ,
data = house_data_subset_trn
)
RMSE for the train and test Model
sqrt(mean((house_data_subset_trn$salePrice_new - predict(ForthModel_fixed_train, house_data_subset_trn)) ^ 2))
## [1] 0.1174843
sqrt(mean((house_data_subset_tst$salePrice_new - predict(ForthModel_fixed_train, house_data_subset_tst)) ^ 2))
## [1] 0.1129396
We have a train RMSE of 0.1148181 and a test RMSE 0.1159567. These are very close and do not seem to infer over or under fitting.
To further analyze ForthModel_fixed we ran forward and backward AIC and BIC on ForthModel_fixed. All four methods returned the same model. Calling this model ForthModel_AIC_BIC we will run an anova test on it as well as compare RMSE of the train and test sets on both models
ForthModel_AIC_BIC = lm(salePrice_new ~ Fireplaces + Overall.Cond + Overall.Qual + LotArea_new + Bedroom.AbvGr + Year.Built + GrLivArea_new,
data = house_data_subset_trn)
anova(ForthModel_fixed_train, ForthModel_AIC_BIC)
## Analysis of Variance Table
##
## Model 1: salePrice_new ~ Fireplaces + Overall.Cond + Overall.Qual + LotArea_new +
## Bedroom.AbvGr + Year.Built + GrLivArea_new + Garage.Area
## Model 2: salePrice_new ~ Fireplaces + Overall.Cond + Overall.Qual + LotArea_new +
## Bedroom.AbvGr + Year.Built + GrLivArea_new
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1441 20.014
## 2 1442 21.868 -1 -1.854 133.49 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test resturns a very low p-value, as such we prefer the larger model, aka the ForthModel_fixed
output = data.frame(
"ModelName" = c("`ForthModel`", "`ForthModel_AIC_BIC`"),
"Train_RMSE" = c(
forth_trn,
forthaic_trn
),
"Test_RMSE" = c(
forth_tst,
forthaic_tst
)
)
knitr::kable(output)
| ModelName | Train_RMSE | Test_RMSE |
|---|---|---|
ForthModel |
0.1174843 | 0.1129396 |
ForthModel_AIC_BIC |
0.1228056 | 0.1168950 |
output = data.frame(
"ModelName" = c("`Model 1`", "`Model 2`", "`Model 3`", "`Model 4`", "Fixed Model"),
"AdjustedR2" = c(
summary(FirstModel)$adj.r.squared ,
summary(SecondModel)$adj.r.squared,
summary(ThirdModel)$adj.r.squared,
summary(ForthModel)$adj.r.squared,
summary(ForthModel_fixed)$adj.r.squared
)
)
knitr::kable(output)
| ModelName | AdjustedR2 |
|---|---|
Model 1 |
0.8366836 |
Model 2 |
0.8790733 |
Model 3 |
0.9241390 |
Model 4 |
0.8849385 |
| Fixed Model | 0.9071952 |
Fixed model is the leading in terms of Adjusted R2.
RMSE for each modelcalc_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
output = data.frame(
"ModelName" = c("`Model 1`", "`Model 2`", "`Model 3`", "`Model 4`", "`Fixed Model`"),
"RMSE" = c(
calc_loocv_rmse(FirstModel) ,
calc_loocv_rmse(SecondModel),
calc_loocv_rmse(ThirdModel),
calc_loocv_rmse(ForthModel),
calc_loocv_rmse(ForthModel_fixed)
)
)
knitr::kable(output)
| ModelName | RMSE |
|---|---|
Model 1 |
Inf |
Model 2 |
Inf |
Model 3 |
Inf |
Model 4 |
Inf |
Fixed Model |
0.0003638 |
| The fixed Model | has a perfect LOOCV RMSE, as compared to other models which shows infinty. |
# VIF
sort(vif(ForthModel_fixed))
## LotArea_new Overall.Cond Fireplaces Bedroom.AbvGr Garage.Area
## 1.288926 1.291559 1.367961 1.744734 1.785370
## Year.Built Overall.Qual GrLivArea_new
## 2.084127 2.663422 3.110250
sum(vif(FirstModel)>5)/length(coef(FirstModel))
## [1] 0.15625
sum(vif(SecondModel)>5)/length(coef(SecondModel))
## [1] 0.825
sum(vif(ForthModel)>5)/length(coef(ForthModel))
## [1] 0.25
sum(vif(ForthModel_fixed)>5)/length(coef(ForthModel_fixed))
## [1] 0
output = data.frame(
"ModelName" = c("`Model 1`", "`Model 2`", "`Model 3`", "`Model 4`", "`Fixed Model`"),
"AIC" = c(AIC(FirstModel) ,
AIC(SecondModel),
AIC(ThirdModel),
AIC(ForthModel),
AIC(ForthModel_fixed)
)
)
knitr::kable(output)
| ModelName | AIC |
|---|---|
Model 1 |
68409.047 |
Model 2 |
67546.083 |
Model 3 |
66509.444 |
Model 4 |
-3376.961 |
Fixed Model |
-35723.186 |
Once again Fixed Model has the better AIC value.
We choose the ForthModel_fixed model as our final one, It is best in terms of Adjusted R2 and Loocv RMSE.
Let’s see which predictors we finally have :
Final_Model = ForthModel_fixed
length(coef(Final_Model))
## [1] 9
names(coef(Final_Model))
## [1] "(Intercept)" "Fireplaces" "Overall.Cond" "Overall.Qual"
## [5] "LotArea_new" "Bedroom.AbvGr" "Year.Built" "GrLivArea_new"
## [9] "Garage.Area"
Let’s see if we can improve further and see how it performs using step function.
final_model_v2 = step(Final_Model, trace = 0)
par(mfrow = c(2,2))
plot(final_model_v2, main = "Final Model V2")
###Evaluation
output = data.frame(
"ModelName" = c("`Final Model`", "`Final Model v2`"),
"AdjustedR2" = c(
summary(Final_Model)$adj.r.squared,
summary(final_model_v2)$adj.r.squared
),
"RMSE" = c(
calc_loocv_rmse(Final_Model),
calc_loocv_rmse(final_model_v2)
),
"AIC" = c(
abs(AIC(Final_Model)) ,
abs(AIC(final_model_v2))
),
"BIC" = c(
abs(AIC(Final_Model, k=log(nrow(house_data_subset)))),
abs(AIC(final_model_v2, k=log(nrow(house_data_subset))))
)
)
knitr::kable(output)
| ModelName | AdjustedR2 | RMSE | AIC | BIC |
|---|---|---|---|---|
Final Model |
0.9071952 | 0.0003638 | 35723.19 | 35663.46 |
Final Model v2 |
0.9071952 | 0.0003638 | 35723.19 | 35663.46 |
Based on the results above, we can conclude that we cannot make any further improvements to the model, Hence Final_Model is our final and best model
Final_Model is our chosen model to do house price prediction, we have below observations:
summary(Final_Model)
##
## Call:
## lm(formula = (salePrice_new^-0.04237718) ~ Fireplaces + Overall.Cond +
## Overall.Qual + LotArea_new + Bedroom.AbvGr + Year.Built +
## GrLivArea_new + Garage.Area, data = house_data_subset, subset = cd_model4 <=
## 4/length(cd_model4))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.424e-03 -2.405e-04 -3.750e-06 2.275e-04 1.758e-03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.420e-01 7.552e-04 1247.41 <2e-16 ***
## Fireplaces -1.453e-04 1.273e-05 -11.41 <2e-16 ***
## Overall.Cond -1.788e-04 7.395e-06 -24.18 <2e-16 ***
## Overall.Qual -2.925e-04 8.337e-06 -35.08 <2e-16 ***
## LotArea_new -4.032e-04 1.569e-05 -25.70 <2e-16 ***
## Bedroom.AbvGr 1.169e-04 1.141e-05 10.24 <2e-16 ***
## Year.Built -1.275e-05 3.406e-07 -37.45 <2e-16 ***
## GrLivArea_new -1.430e-03 3.885e-05 -36.80 <2e-16 ***
## Garage.Area -6.730e-07 4.481e-08 -15.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0003631 on 2739 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9075, Adjusted R-squared: 0.9072
## F-statistic: 3358 on 8 and 2739 DF, p-value: < 2.2e-16
Variance Assumption with NVCTEST
ncvTest(ForthModel_fixed)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2.963276, Df = 1, p = 0.085175
# plot studentized residuals vs. fitted values
spreadLevelPlot(ForthModel_fixed)
##
## Suggested power transformation: 1.42485
The significant predictors.
summary(Final_Model)$coefficients[summary(Final_Model)$coefficients[ ,4] < 0.05,]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.420242e-01 7.551853e-04 1247.40792 0.000000e+00
## Fireplaces -1.452644e-04 1.273207e-05 -11.40933 1.721612e-29
## Overall.Cond -1.788365e-04 7.395033e-06 -24.18333 2.864186e-117
## Overall.Qual -2.924783e-04 8.336759e-06 -35.08297 5.051918e-223
## LotArea_new -4.031799e-04 1.568608e-05 -25.70305 1.051763e-130
## Bedroom.AbvGr 1.168619e-04 1.140782e-05 10.24401 3.418634e-24
## Year.Built -1.275450e-05 3.406098e-07 -37.44609 3.503978e-248
## GrLivArea_new -1.429505e-03 3.884768e-05 -36.79771 3.189816e-241
## Garage.Area -6.729665e-07 4.481273e-08 -15.01731 4.802337e-49
Below is the accuracy of the model
summary(Final_Model)$r.squared
## [1] 0.9074655
SO WE SEE OUR MADE UP MODELS UNDERFIT AND OUR FINAL MODEL WAS STILL THE BEST AND THE FORWARD AIC AND BACKWARS AIC OFF THAT MODEL WAS STILL WORSE THAN THE ForthModel_fixed
This project is being contributed by below team members: